How to identify language
and genre using zip

By Adok/Hugi

According to an article published in New Scientist, two new application of zip apart from compression have been found: identification of the language of a text and identification of the style or composer of a music!

How does it work?

As you perhaps know, zip compression uses a dictionary algorithm which scans the target file for repeating byte-sequences (e.g. English, German, Russian,... words), aligns shorter codes to each and stores original sequences plus codes in a table. The more often a certain sequence of significance length appears in a file, the better the file is compressed. Moreover, the less the number of different sequences that appear in a file, the better the file is compressed.

Therefore, if you append an already existing text file with another text that is in the same language, the resulting file will be better compressed than in case you add a text in a different language because in the first case, both texts probably share a lot of common sequences, while this condition is most likely not fulfilled in the second case.

Hence, in order to detect the language of a text using zip, you first take a couple of long text files in various, however known languages, compress them and write down the file size of each. Afterwards you append the unknown file to each of the uncompressed and compress them again. The smaller the difference between the compressed original and the compressed appended files, the more likely the languages are to be the same.

Researches of the Dutch National Research Institute in Amsterdam wondered if zip compression could also be used in order to detect the music genre or the composer of a song. So they took several tunes, including pieces from Beethoven, Miles Davis and Jimi Hendrix, removed any data unrelated to the rhythm and melody of the tune and applied the same procedure as for text files using the compression program bzip2.

In a test with 12 each of classical, jazz and rock pieces, the results were supposed to be pretty good. 10 of the jazz, 9 of the rock and most of the classical pieces appeared in three distinct branches of the tree.

In addition, the program was ordered to sort 32 classical pieces. Tunes of each composer were clustered on a separate branch.

There is already a practical research application of this simple yet smart new technique: music historians have kept debating about the authorship of some famous pieces (e.g. the Austrian national anthem); zip might back up some of their arguments. For example, W.A. Mozart had assigned his student F.X. Süssmayr to complete his unfinished pieces after the end of his life. Therefore it is unclear which parts of Mozart's last works have been composed by the master himself and which originate from Süssmayr. Jeremy Summerly of the Royal Academy of Music in London would like to employ zip to find an answer on this unresolved question.


Adok/Hugi